##Model Introduction
A probit regression is a version of the generalized linear model used to model dichotomous outcome variables. It uses the inverse standard normal distribution as a linear combination of the predictors. The binary outcome variable Y is assumed to have a Bernoulli distribution with parameter p (where the success probability is \(p \in (0,1)\)). Hence, the probit link function is \[probit(EY) = \Phi^{-1}(p_i) = \Phi^{-1}(P_i[Y=1]) = \sum_{k=0}^{k=n}\beta_kx_{ik}\] where \[\Phi=\frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\alpha+\beta X} exp(-\frac{1}{2}Z^2)dZ\].
##Dataset: Female Labor Participation
In this tutorial, we work on Mroz data set on female labor participation with 8 variables. The data covers a sample of 753 married white women aged between 30 and 60 collected in 1975. The original source for this data is Mroz, T.A. (1987). “The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions.” Econometrica 55, 765-799. The subset of the original data set used in this study can be found here. The description of the variables can be found below.
| Variable(s) | Description | Type |
|---|---|---|
| lfp | Labor-force participation of the married white woman | Categorical: 0/1 |
| k5 | Number of children younger than 6 years old | Positive integer |
| k618 | Number of children aged 6-18 | Positive integer |
| age | Age in years | Positive integer |
| wc | Wife`s college attendance | Categorical: 0/1 |
| hc | Husband`s college attendance | Categorical: 0/1 |
| lwg | Log expected wage rate for women in the labor force | Numerical |
| inc | Family income without the wife`s income | Numerical |
The goal of the tutorial is to identify whether certain characteristics of a woman’s household and personal life can predict her labor-force participation.
##Languages {.tabset}
###R
First, we load data into R by using read.csv given that the data is in a comma-separated format.
mroz = read.csv('./mroz.csv')
This is what the first six rows look like in R:
head(mroz)
## X lfp k5 k618 age wc hc lwg inc
## 1 1 yes 1 0 32 no no 1.2101647 10.910
## 2 2 yes 0 2 30 no no 0.3285041 19.500
## 3 3 yes 1 3 35 no no 1.5141279 12.040
## 4 4 yes 0 3 34 no no 0.0921151 6.800
## 5 5 yes 1 2 31 yes no 1.5242802 20.100
## 6 6 yes 0 0 54 no no 1.5564855 9.859
Then, we change all binary variables to be numeric and factorize them.
mroz$lfp = ifelse(mroz$lfp=="yes", 1, 0)
mroz$wc = ifelse(mroz$wc=="yes", 1, 0)
mroz$hc = ifelse(mroz$hc=="yes", 1, 0)
mroz$lfp = factor(mroz$lfp)
mroz$wc = factor(mroz$wc)
mroz$hc = factor(mroz$hc)
Here is the quick summary of the data.
summary(mroz)
## X lfp k5 k618 age
## Min. : 1 0:325 Min. :0.0000 Min. :0.000 Min. :30.00
## 1st Qu.:189 1:428 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:36.00
## Median :377 Median :0.0000 Median :1.000 Median :43.00
## Mean :377 Mean :0.2377 Mean :1.353 Mean :42.54
## 3rd Qu.:565 3rd Qu.:0.0000 3rd Qu.:2.000 3rd Qu.:49.00
## Max. :753 Max. :3.0000 Max. :8.000 Max. :60.00
## wc hc lwg inc
## 0:541 0:458 Min. :-2.0541 Min. :-0.029
## 1:212 1:295 1st Qu.: 0.8181 1st Qu.:13.025
## Median : 1.0684 Median :17.700
## Mean : 1.0971 Mean :20.129
## 3rd Qu.: 1.3997 3rd Qu.:24.466
## Max. : 3.2189 Max. :96.000
Next, we run a probit regression using lfp as a response variable and all the remaining variables as predictors.
mroz.probit <- glm(lfp ~ k5 + k618 + age + wc + hc + lwg + inc,
data = mroz,
family = binomial(link = "probit"))
summary(mroz.probit)
##
## Call:
## glm(formula = lfp ~ k5 + k618 + age + wc + hc + lwg + inc, family = binomial(link = "probit"),
## data = mroz)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1359 -1.1024 0.5967 0.9746 2.2236
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.918418 0.382356 5.017 5.24e-07 ***
## k5 -0.874712 0.114423 -7.645 2.10e-14 ***
## k618 -0.038595 0.040950 -0.942 0.345942
## age -0.037824 0.007605 -4.973 6.58e-07 ***
## wc1 0.488310 0.136731 3.571 0.000355 ***
## hc1 0.057172 0.124207 0.460 0.645306
## lwg 0.365635 0.089992 4.063 4.85e-05 ***
## inc -0.020525 0.004852 -4.230 2.34e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1029.75 on 752 degrees of freedom
## Residual deviance: 905.39 on 745 degrees of freedom
## AIC: 921.39
##
## Number of Fisher Scoring iterations: 4
We first make a cross-tab of categorical predictors with our binary response variable.
table(mroz$lfp, mroz$hc)
##
## 0 1
## 0 207 118
## 1 251 177
table(mroz$lfp, mroz$wc)
##
## 0 1
## 0 257 68
## 1 284 144
table(mroz$lfp, mroz$hc)
##
## 0 1
## 0 207 118
## 1 251 177
table(mroz$lfp, mroz$wc)
##
## 0 1
## 0 257 68
## 1 284 144
library(effects)
all.effects <- allEffects(mod = mroz.probit)
plot(all.effects, type="response", ylim=c(0,1))
###Python
####Loading Data
The data set has binary response variable lfp which stands for labor force participation
First we start by loading the data into memory using pandas package. We also load some relevant packages used in this analysis.
import pandas as pd
import numpy as np
from statsmodels.discrete.discrete_model import Probit
data = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/carData/Mroz.csv')
print(data.head())
This looks like
Unnamed: 0 lfp k5 k618 age wc hc lwg inc
0 1 yes 1 0 32 no no 1.210165 10.910001
1 2 yes 0 2 30 no no 0.328504 19.500000
2 3 yes 1 3 35 no no 1.514128 12.039999
3 4 yes 0 3 34 no no 0.092115 6.800000
4 5 yes 1 2 31 yes no 1.524280 20.100000
####Data Cleaning
Since some data is read in as strings we can transform them into binary categorical data using the following command. We also drop the first column as it is read in with row numbers, which we do not need.
data = data.drop(data.columns[0], axis = 1)
data["lfp"] = data["lfp"] == "yes"
data["wc"] = data["wc"] == "yes"
data["hc"] = data["hc"] == "yes"
Looking at the data again we see:
lfp k5 k618 age wc hc lwg inc
0 True 1 0 32 False False 1.210165 10.910001
1 True 0 2 30 False False 0.328504 19.500000
2 True 1 3 35 False False 1.514128 12.039999
3 True 0 3 34 False False 0.092115 6.800000
4 True 1 2 31 True False 1.524280 20.100000
####Summary Statistics
To generate some summary statistics we can use the functions describe on a data frame.
print(data.describe())
This generates summary statistics for the continuous variables in our dataset.
k5 k618 age lwg inc
count 753.000000 753.000000 753.000000 753.000000 753.000000
mean 0.237716 1.353254 42.537849 1.097115 20.128965
std 0.523959 1.319874 8.072574 0.587556 11.634799
min 0.000000 0.000000 30.000000 -2.054124 -0.029000
25% 0.000000 0.000000 36.000000 0.818086 13.025000
50% 0.000000 1.000000 43.000000 1.068403 17.700001
75% 0.000000 2.000000 49.000000 1.399717 24.466000
max 3.000000 8.000000 60.000000 3.218876 96.000000
####Fitting Regression
First we break our dataset into response variable and predictor variables. Then we use the statsmodels function to fit our Probit regression.
Y = data["lfp"]
X = data.drop(["lfp"], 1)
model = Probit(Y, X.astype(float))
result = model.fit()
print(result.summary())
The following is the results of our regression
Optimization terminated successfully.
Current function value: 0.618620
Iterations 5
Probit Regression Results
==============================================================================
Dep. Variable: lfp No. Observations: 753
Model: Probit Df Residuals: 746
Method: MLE Df Model: 6
Date: Mon, 26 Nov 2018 Pseudo R-squ.: 0.09527
Time: 23:16:26 Log-Likelihood: -465.82
converged: True LL-Null: -514.87
LLR p-value: 6.234e-19
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
k5 -0.6136 0.098 -6.249 0.000 -0.806 -0.421
k618 0.0674 0.034 1.965 0.049 0.000 0.135
age -0.0021 0.003 -0.775 0.439 -0.007 0.003
wc 0.4497 0.133 3.372 0.001 0.188 0.711
hc 0.1267 0.122 1.040 0.298 -0.112 0.365
lwg 0.4632 0.084 5.486 0.000 0.298 0.629
inc -0.0187 0.005 -3.983 0.000 -0.028 -0.010
==============================================================================
###Stata
Firstly, We import the Mroz data from website and show the first six rows of the dataset.
*Importing data
import delimited https://vincentarelbundock.github.io/Rdatasets/csv/carData/Mroz.csv, clear
save mroz,replace
use mroz,clear
*List the first six rows
list if v1<=6
Then, We change all binary variables to be numeric, and we get a summary of the data. Our response is lfp and its mean is 0.57. The range of age is from 30 to 60.
*Change variables with values yes/no to 1/0
gen lfpart =1 if lfp == "yes"
replace lfpart =0 if lfp == "no"
gen wifec =1 if wc == "yes"
replace wifec =0 if wc == "no"
gen husbc =1 if hc == "yes"
replace husbc =0 if hc == "no"
drop lfp wc hc
rename lfpart lfp
rename wifec wc
rename husbc hc
*Get the summary of the data
summ
Now, we fit our data by probit regression. lfp is the response and the remaining variables are predictors. Looking at the p-values, all variables have highly significant, except k618 and hc.
*Fitting the data by probit regression
probit lfp k5 k618 age lwg inc i.wc i.hc
We get a summary of the probit prediction from the fitted model, we get the smallest probability is
0.005691 and the largest probability is 0.9745. The 50% percentile is 0.5782336, which is close to its mean we showed above.
*Predicting the probability of labor-force participation
predict prob_lfp
summ prob_lfp, detail
Now, we predict the data for groups defined by levels of categorical variables. ##### Group by hc First, we make a table of frequently count of hc and lfp we predict the lfp for two groups: hc=0 and hc=1, and we keep other variables at mean.
tab lfp hc
*use margins for each level of hc
margins hc, atmeans
The marginal probability of hc=1 (husband has attained college) is 0.59 and it slightly higher than the marginal probability of hc=0 (husband has not attained college), which is 0.57. There is not obivious differnce. It is reasonable because the p-value of hc is very high.
The table of frequently shows that when wc=0, the proportion of lfp is average, which is closed to 0.5. However, when wc=1, the proportion of lfp=1 is much higher.
tab lfp wc
we predict the lfp for two groups: wc=0 and wc=1, and we keep other variables at mean.
*use margins for each level of wc
margins wc, atmeans
The result shows that the marginal probability is 0.71 when wc=1 and the marginal probability is 0.52 when wc=0. The probability of participating labor-force is higher when wife has attended college. We can say that wife’s college attendance is an important predictor.
We can go deeper on the predictor wc. We predict lfp for group by age and wc. Age is at every 10 years of age from 30 to 60. Since the output of marginal function is long, we make a plot to visualize the output and it is easier to interpert.
*use margins for each level of wc and age
margins, at(age=(30(10)60) wc=(0 1)) atmeans vsquish
marginsplot
From the marginal plot, we can conclude that when age is increasing, the probability is decreasing. Also, The probability of wc=1 is always higher than wx=0. At age 60, the variablity is the highest because the 95% confidence interval is the widest.
The table of frequently shows that the proportion of lfp is decreasing when k5 is increasing.
tab lfp k5
we predict the lfp by k5= 0 1 2 3, and we keep other variables at mean. Also, we make a plot to visualize the data.
*use margins for each level of k5
margins, at(k5=(0 1 2 3)) atmeans
marginsplot
The output shows that when women do not have any children 5 years old or younger, the probability of participating labor-force is 0.66 which is higher than the average. However, after they had childrens, the probability of participating labor-force is decreasing. Therefore, we can conclude that k5 is a significant predictor.
###SAS
Since for the original mroz.csv dataset, there are three covariates whose variable type are binary charater (“yes/no”): lfp, wc, hc. In this part, a processed data file would be used, where the values of the three character type variables are transfered from “yes/no” to numeric 1/0.
/*Importing data*/
proc import datafile = 'C:\Users\lanshi\Desktop\Mroz.csv'
out = work.mroz
dbms = CSV
;
run;
Display the summary table of the dataset (Seperate the 3 binary variables and the others).
/*For the other non-binary variables*/
proc means data=mroz;
var k5 k618 age lwg inc;
run;
/*For those 3 binary variables, consider individuals and interactions between reponse and wc/hc.*/
proc freq data=mroz;
tables lfp wc hc lfp*wc lfp*hc;
run;
Now, we fit our data by probit regression. lfp is the response and the remaining variables are predictors. By adding argument “descending”, we would be able to model 1s rather than 0s, which means predicting the probability of woman getting into label force (lfp=1) versus not getting in (lfp=0).
/*Fitting the data by probit regression*/
proc logistic data=mroz descending;
class lfp wc hc / param=ref ;
model lfp = k5 k618 age lwg inc wc hc /link=probit;
run;
Here are the results we get from the regression:
The result of the Global Null Hypothesis (by Likelihood Ratio, Score and Wald test) indicates that the model is statistically significant.
As for each variable in the model, it is shown by the above two tables that: variable k618 and hc both have p-value greater than 0.05, thus under significant level alpha being 0.05, they are not statistically significant. The reason for k618 might be that kids of 6-18 are much more independent than kids less than 6, and their mom could have time to work. Regarding hc, whether husband has attended college won’t really affect their wife’s decision on whether being in the labor force or not.
Detailed Interpretations: (Definition of z-score: the probit regression coefficients give the change in the probit index, also called a z-score, for a one unit increase in the predictor variable.) • k5: For each one more kid less than 6 years old, the z-score decreases by 0.8747. • k618: For each one more kid of 6-18 years old, the z-score decreases by 0.0386. • age: For one year increase in a woman’s age, the z-score decreases by 0.0378. • lwg: For a one unit increase in log(wage), the z-score increases by 0.3656. • inc: For a one unit increase in (family income – wage*hours)/1000, the z-score decreases by 0.0205. • wc: Wife having attended college would increases the z-score by 0.4883. • hc: Husband having attended college would increases the z-score by 0.0572.
Now, we predict the data for groups defined by levels of categorical variables. Keeping other variables at mean.
/*Fitting the data by probit regression*/
proc logistic data=mroz descending plots=EFFECT;
class lfp wc hc / param=ref ;
model lfp = k5 k618 age lwg inc wc hc /link=probit;
output out=estimated predicted=estprob l=lower95 u=upper95;
run;
The code above also creates several plots for model diagnosis:
We can see from these diagonosis plots, the regression model’s performance is not bad.